{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2024 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "tuOe1ymfHZPu"
      },
      "outputs": [],
      "source": [
        "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FAhwDgtUGIEo"
      },
      "source": [
        "# Getting Started with Xinference and Gemma\n",
        "\n",
        "This notebook demonstrates how to use Xinference to load a Gemma 2 model and run inference utilizing the GPU provided by Google Colab.\n",
        "\n",
        "[**Gemma**](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs) available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n",
        "\n",
        "[**llama.cpp**](https://github.com/ggerganov/llama.cpp) is a C++ implementation of Meta AI's LLaMA and other large language model architectures, designed for efficient performance on local machines or within environments like Google Colab. It enables you to run large language models without needing extensive computational resources.\n",
        "\n",
        "To make working with llama.cpp more accessible,\n",
        "[**llama-cpp-python**](https://github.com/abetlen/llama-cpp-python) provides Python bindings for the C++ library. This allows you to enjoy the performance optimizations of `llama.cpp` while benefiting from the simplicity and flexibility of Python. With llama-cpp-python, you get a convenient API for loading models, generating text, and customizing inference parameters.\n",
        "\n",
        "[**Xorbits Inference (Xinference)**](https://inference.readthedocs.io/en/latest/) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications. You will be using different **Gemma 2** model variants in the GGUF format for this tutorial, but the code should be easily transferrable to all LLM chat models supported by Xinference.\n",
        "\n",
        "The latest complete list of supported models can be found in Xorbits Inference's [official GitHub page](https://github.com/xorbitsai/inference/blob/main/README.md).\n",
        "\n",
        "By the end of this notebook, you will learn how to:\n",
        "\n",
        "- **Install Xinference and its dependencies**: Set up Xinference along with `llama.cpp` to run Gemma models in the GGUF format.\n",
        "- **Start the Xinference local server**: Initialize the server to run models locally.\n",
        "- **Launch different Gemma 2 model variants**: Select quantization and model size parameters to launch various Gemma 2 models.\n",
        "- **Interact with the model using Xinference's Python client**: Use the client to communicate with the model and receive responses.\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Using_with_Xinference.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FaqZItBdeokU"
      },
      "source": [
        "## Setup\n",
        "\n",
        "### Select the Colab runtime\n",
        "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
        "\n",
        "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
        "2. Select **Change runtime type**.\n",
        "3. Under **Hardware accelerator**, select **T4 GPU**.\n",
        "\n",
        "**Once you've completed these steps, you're ready to move on to the next section where you'll set up the environment for Xinference to work with llama.cpp.**\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eFzlnU4gG_JL"
      },
      "source": [
        "### Install Xinference and dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jJgiCGt88A6X"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/40.7 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.7/40.7 kB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.4/24.4 MB\u001b[0m \u001b[31m68.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m94.9/94.9 kB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.8/5.8 MB\u001b[0m \u001b[31m98.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m88.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m57.1/57.1 MB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m320.1/320.1 kB\u001b[0m \u001b[31m25.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.8/63.8 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.5/40.5 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m278.6/278.6 kB\u001b[0m \u001b[31m22.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m149.3/149.3 kB\u001b[0m \u001b[31m14.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.1/11.1 MB\u001b[0m \u001b[31m92.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.2/73.2 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m85.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m525.6/525.6 kB\u001b[0m \u001b[31m36.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m130.2/130.2 kB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Building wheel for quantile-python (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m443.8/443.8 MB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.5/45.5 kB\u001b[0m \u001b[31m2.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hSelecting previously unselected package libonig5:amd64.\n",
            "(Reading database ... 123630 files and directories currently installed.)\n",
            "Preparing to unpack .../libonig5_6.9.7.1-2build1_amd64.deb ...\n",
            "Unpacking libonig5:amd64 (6.9.7.1-2build1) ...\n",
            "Selecting previously unselected package libjq1:amd64.\n",
            "Preparing to unpack .../libjq1_1.6-2.1ubuntu3_amd64.deb ...\n",
            "Unpacking libjq1:amd64 (1.6-2.1ubuntu3) ...\n",
            "Selecting previously unselected package jq.\n",
            "Preparing to unpack .../jq_1.6-2.1ubuntu3_amd64.deb ...\n",
            "Unpacking jq (1.6-2.1ubuntu3) ...\n",
            "Setting up libonig5:amd64 (6.9.7.1-2build1) ...\n",
            "Setting up libjq1:amd64 (1.6-2.1ubuntu3) ...\n",
            "Setting up jq (1.6-2.1ubuntu3) ...\n",
            "Processing triggers for man-db (2.10.2-1) ...\n",
            "Processing triggers for libc-bin (2.35-0ubuntu3.4) ...\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libumf.so.0 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libur_loader.so.0 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtcm_debug.so.1 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero.so.0 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libur_adapter_opencl.so.0 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtcm.so.1 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link\n",
            "\n",
            "/sbin/ldconfig.real: /usr/local/lib/libhwloc.so.15 is not a symbolic link\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# Install Xinference\n",
        "!pip install -q xinference\n",
        "\n",
        "# The llama-cpp-python library allows us to leverage GPUs\n",
        "!pip install llama-cpp-python==0.2.90 \\\n",
        "  -q -U --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122\n",
        "\n",
        "# jq is a powerful command-line JSON processor that's widely used for parsing,\n",
        "# filtering, and formatting JSON data.\n",
        "!apt-get install -qq jq"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2EACA0GYHm2o"
      },
      "source": [
        "## Start Local Server\n",
        "\n",
        "\n",
        "To start a local instance of Xinference, run `xinference` in the background via `nohup`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5EM01Gq7IQ2y"
      },
      "outputs": [],
      "source": [
        "!nohup xinference-local  > xinference.log 2>&1 &"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OErerslQH4lU"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Waiting for the Xinference server to start...\n",
            "Xinference server should be running now.\n",
            "2024-11-28 12:24:10,788 xinference.core.supervisor 1022 INFO     Xinference supervisor 127.0.0.1:44380 started\n",
            "2024-11-28 12:24:10,826 xinference.core.worker 1022 INFO     Starting metrics export server at 127.0.0.1:None\n",
            "2024-11-28 12:24:10,827 xinference.core.worker 1022 INFO     Checking metrics export server...\n",
            "2024-11-28 12:24:12,083 xinference.core.worker 1022 INFO     Metrics server is started at: http://127.0.0.1:33897\n",
            "2024-11-28 12:24:12,083 xinference.core.worker 1022 INFO     Purge cache directory: /root/.xinference/cache\n",
            "2024-11-28 12:24:12,084 xinference.core.worker 1022 INFO     Connected to supervisor as a fresh worker\n",
            "2024-11-28 12:24:12,097 xinference.core.worker 1022 INFO     Xinference worker 127.0.0.1:44380 started\n",
            "Task was destroyed but it is pending!\n",
            "task: <Task pending name='Task-5' coro=<ActorCallerThreadLocal._listen() running at /usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py:79> wait_for=<Future cancelled>>\n",
            "Exception ignored in: <coroutine object ActorCallerThreadLocal._listen at 0x78c3c4b0bd10>\n",
            "Traceback (most recent call last):\n",
            "  File \"/usr/lib/python3.10/ast.py\", line 50, in parse\n",
            "    return compile(source, filename, mode, flags,\n",
            "RuntimeError: coroutine ignored GeneratorExit\n",
            "2024-11-28 12:24:18,615 xinference.api.restful_api 951 INFO     Starting Xinference at endpoint: http://127.0.0.1:9997\n",
            "2024-11-28 12:24:18,777 uvicorn.error 951 INFO     Uvicorn running on http://127.0.0.1:9997 (Press CTRL+C to quit)\n"
          ]
        }
      ],
      "source": [
        "import time\n",
        "print(\"Waiting for the Xinference server to start...\")\n",
        "time.sleep(30)  # Wait for 30 seconds\n",
        "print(\"Xinference server should be running now.\")\n",
        "\n",
        "# View server logs\n",
        "!cat xinference.log"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WXtUJSC3I3kh"
      },
      "source": [
        "Congrats! You now have Xinference running in Colab machine. The default host and ip is 127.0.0.1 and 9997 respectively.\n",
        "\n",
        "\n",
        "Once Xinference is running, there are multiple ways you can try it out: via the web UI, via cURL, via the command line, or via the Xinference’s Python client."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0mkyrGIHJekz"
      },
      "source": [
        "The command line tool is `xinference`. You can list the commands that can be used by running:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yayFuLIgJhYX"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Usage: xinference [OPTIONS] COMMAND [ARGS]...\n",
            "\n",
            "  Xinference command-line interface for serving and deploying models.\n",
            "\n",
            "Options:\n",
            "  -v, --version       Show the current version of the Xinference tool.\n",
            "  --log-level TEXT    Set the logger level. Options listed from most log to\n",
            "                      least log are: DEBUG > INFO > WARNING > ERROR > CRITICAL\n",
            "                      (Default level is INFO)\n",
            "  -H, --host TEXT     Specify the host address for the Xinference server.\n",
            "  -p, --port INTEGER  Specify the port number for the Xinference server.\n",
            "  --help              Show this message and exit.\n",
            "\n",
            "Commands:\n",
            "  cached         List all cached models in Xinference.\n",
            "  cal-model-mem  calculate gpu mem usage with specified model size and...\n",
            "  chat           Chat with a running LLM.\n",
            "  engine         Query the applicable inference engine by model name.\n",
            "  generate       Generate text using a running LLM.\n",
            "  launch         Launch a model with the Xinference framework with the...\n",
            "  list           List all running models in Xinference.\n",
            "  login          Login when the cluster is authenticated.\n",
            "  register       Register a new model with Xinference for deployment.\n",
            "  registrations  List all registered models in Xinference.\n",
            "  remove-cache   Remove selected cached models in Xinference.\n",
            "  stop-cluster   Stop a cluster using the Xinference framework with the...\n",
            "  terminate      Terminate a deployed model through unique identifier...\n",
            "  unregister     Unregister a model from Xinference, removing it from...\n",
            "  vllm-models    Query and display models compatible with vLLM.\n"
          ]
        }
      ],
      "source": [
        "!xinference --help"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TKJkWhP3kMgP"
      },
      "source": [
        "You can launch different Gemma 2 [model variants](https://inference.readthedocs.io/en/latest/models/builtin/llm/gemma-2-it.html) using the following command. However, for ease of use, you'll mainly rely on the Python API to select the appropriate combinations of model size and quantization.\n",
        "\n",
        "```bash\n",
        "xinference launch \\\n",
        "    --model-engine ${engine} \\\n",
        "    --model-name ${name} \\\n",
        "    --size-in-billions ${model_size} \\\n",
        "    --model-format {format} \\\n",
        "    --quantization ${quantization}\n",
        "```\n",
        "\n",
        "The placeholders in the command can be replaced with the appropriate values:\n",
        "\n",
        "- **`${engine}`**: The model engine to use. Possible options are:\n",
        "  - `llama.cpp` (used with `--model-format ggufv2`)\n",
        "  - `transformers`\n",
        "  - `sglang`\n",
        "  - `vllm`\n",
        "\n",
        "- **`${name}`**: The model name, e.g., `gemma-2-it`.\n",
        "\n",
        "- **`${model_size}`**: The size of the model in billions of parameters. Options are `2`, `9`, or `27`.\n",
        "\n",
        "- **`${format}`**: The model format. Common options include:\n",
        "  - `ggufv2` (used with `llama.cpp` engine)\n",
        "  - `pytorch`\n",
        "  - `awq`\n",
        "  - `gptq`\n",
        "\n",
        "- **`${quantization}`**: The quantization method, which depends on the model format and size.\n",
        "\n",
        "  - When `--model-format` is `ggufv2` and using `llama.cpp`, valid quantizations for Gemma 2 are:\n",
        "\n",
        "    - **For 2B models**: `Q3_K_L`, `Q4_K_M`, `Q4_K_S`, `Q5_K_M`, `Q5_K_S`, `Q6_K`, `Q6_K_L`, `Q8_0`, `f32`.\n",
        "\n",
        "    - **For 9B and 27B models**: `Q2_K`, `Q2_K_L`, `Q3_K_L`, `Q3_K_M`, `Q3_K_S`, `Q4_K_L`, `Q4_K_M`, `Q4_K_S`, `Q5_K_L`, `Q5_K_M`, `Q5_K_S`, `Q6_K`, `Q6_K_L`, `Q8_0`, `f32`.\n",
        "\n",
        "  - When `--model-format` is `pytorch`, the quantization is `none`.\n",
        "\n",
        "  - When `--model-format` is `awq`, the quantization is `Int4`.\n",
        "\n",
        "  - When `--model-format` is `gptq`, the quantization can be `Int3`, `Int4`, or `Int8`.\n",
        "\n",
        "You can also specify the model's UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it automatically, creating a new model instance with a unique ID.\n",
        "\n",
        "**Note**: To simplify the process and ensure valid parameter combinations, it's recommended to use the Python API provided in this notebook. This automatically handles the selection of appropriate model sizes and quantization methods. For this tutorial, you'll stick to using `llama.cpp` as the model engine to run different `GGUF` Gemma 2 models on a modest GPU like the **T4**."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wvhIEjcHKXc5"
      },
      "source": [
        "## Choose a Gemma 2 model\n",
        "Xinference supports a variety of LLMs. Learn more in https://inference.readthedocs.io/en/latest/models/builtin/.\n",
        "\n",
        "Let’s start by running a built-in model: `gemma-2-it`. You can find out more about the available Gemma models [here](https://inference.readthedocs.io/en/latest/models/builtin/llm/gemma-2-it.html).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZZXLEzNwal1K"
      },
      "outputs": [],
      "source": [
        "model_name = 'gemma-2-it'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "5CAWH07obuos"
      },
      "outputs": [],
      "source": [
        "#@title **Select Model Size**\n",
        "\n",
        "#@markdown **Model Size (in billions):**\n",
        "#@markdown - **2**: A smaller model that requires less memory but may offer less performance.\n",
        "#@markdown - **9**: A medium-sized model that balances performance and resource usage.\n",
        "#@markdown - **27**: A large model that provides better performance but requires more memory and computation time.\n",
        "\n",
        "model_size_in_billions = \"27\" #@param [\"2\", \"9\", \"27\"] {type:\"string\"}\n",
        "\n",
        "# Convert model size to integer\n",
        "model_size_in_billions = int(model_size_in_billions)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "eNI6mRQ2b4Ei"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Allowed quantizations for 27B model: ['Q2_K', 'Q2_K_L', 'Q3_K_L', 'Q3_K_M', 'Q3_K_S', 'Q4_K_L', 'Q4_K_M', 'Q4_K_S', 'Q5_K_L', 'Q5_K_M', 'Q5_K_S', 'Q6_K', 'Q6_K_L', 'Q8_0', 'f32']\n"
          ]
        }
      ],
      "source": [
        "# Define allowed quantizations per model size\n",
        "allowed_quantizations = {\n",
        "    2: [\"Q3_K_L\", \"Q4_K_M\", \"Q4_K_S\", \"Q5_K_M\", \"Q5_K_S\", \"Q6_K\", \"Q6_K_L\", \"Q8_0\", \"f32\"],\n",
        "    9: [\"Q2_K\", \"Q2_K_L\", \"Q3_K_L\", \"Q3_K_M\", \"Q3_K_S\", \"Q4_K_L\", \"Q4_K_M\", \"Q4_K_S\", \"Q5_K_L\", \"Q5_K_M\", \"Q5_K_S\", \"Q6_K\", \"Q6_K_L\", \"Q8_0\", \"f32\"],\n",
        "    27: [\"Q2_K\", \"Q2_K_L\", \"Q3_K_L\", \"Q3_K_M\", \"Q3_K_S\", \"Q4_K_L\", \"Q4_K_M\", \"Q4_K_S\", \"Q5_K_L\", \"Q5_K_M\", \"Q5_K_S\", \"Q6_K\", \"Q6_K_L\", \"Q8_0\", \"f32\"]\n",
        "}\n",
        "#@markdown **Allowed Quantizations for Selected Model Size:**\n",
        "print(f\"Allowed quantizations for {model_size_in_billions}B model: {allowed_quantizations[model_size_in_billions]}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "0gB9TzEtcos1"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "aef6cdaca1f14459b22b0d3d06e11c72",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Dropdown(description='Quantization:', options=('Q2_K', 'Q2_K_L', 'Q3_K_L', 'Q3_K_M', 'Q3_K_S', 'Q4_K_L', 'Q4_K…"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "#@title **Select Quantization**\n",
        "\n",
        "#@markdown **Quantization Method:**\n",
        "#@markdown - **Q2_K to Q8_0**: Lower quantization levels (e.g., Q2_K) reduce memory usage but may affect model accuracy.\n",
        "#@markdown - **Higher quantization levels** (e.g., Q8_0) preserve model performance but require more memory as it uses more bits per parameter.\n",
        "\n",
        "\n",
        "#@markdown **Note:** Larger models and higher quantization levels require more memory and computation time. Ensure that your Colab instance has sufficient resources. i.e. Choose the appropriate **GPU Runtime Type** (T4, L4, A100) for your model.\n",
        "import ipywidgets as widgets\n",
        "\n",
        "# Create empty dropdown widget for quantization\n",
        "quantization_dropdown = widgets.Dropdown(\n",
        "    options=allowed_quantizations[model_size_in_billions],\n",
        "    description='Quantization:',\n",
        "    disabled=False,\n",
        ")\n",
        "\n",
        "# Display the widget\n",
        "display(quantization_dropdown)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cOfOQ25KUGdY"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Selected model size: 27B\n",
            "Selected quantization: Q2_K\n",
            "Model UID: gemma-2-27b-it-Q2_K\n"
          ]
        }
      ],
      "source": [
        "# Set the quantization and model name based on the selected size\n",
        "quantization = quantization_dropdown.value\n",
        "model_uid = f\"gemma-2-{model_size_in_billions}b-it-{quantization}\"\n",
        "\n",
        "print(f\"Selected model size: {model_size_in_billions}B\")\n",
        "print(f\"Selected quantization: {quantization}\")\n",
        "print(f\"Model UID: {model_uid}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jus0PzJYq1P9"
      },
      "source": [
        "## Launch the Gemma Model with Xinference\n",
        "\n",
        "Now you'll use Xinference's Python client to launch the Gemma model with the selected parameters."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "o5jXGGEb-eLs"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Launching model 'gemma-2-it' with quantization 'Q2_K'...\n",
            "This may take several minutes as the model needs to be downloaded.\n",
            "Model launched successfully!\n"
          ]
        }
      ],
      "source": [
        "from xinference.client import RESTfulClient\n",
        "\n",
        "# Define the client to connect to Xinference\n",
        "port = 9997  # Default Xinference port\n",
        "client = RESTfulClient(f\"http://localhost:{port}\")\n",
        "\n",
        "# Launch the Gemma model with selected parameters\n",
        "try:\n",
        "    print(f\"Launching model '{model_name}' with quantization '{quantization}'...\")\n",
        "    print(\"This may take several minutes as the model needs to be downloaded.\")\n",
        "    model_uid = client.launch_model(\n",
        "        model_uid=model_uid,\n",
        "        model_engine=\"llama.cpp\",\n",
        "        model_name=model_name,\n",
        "        model_size_in_billions=model_size_in_billions,\n",
        "        model_format=\"ggufv2\",\n",
        "        quantization=quantization.lower(),\n",
        "    )\n",
        "    print(\"Model launched successfully!\")\n",
        "except Exception as e:\n",
        "    print(f\"An error occurred while launching the model: {e}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q--ic56eNDyo"
      },
      "source": [
        "When you start a model for the first time, Xinference will download the model parameters from Hugging Face. This process might take a few minutes, depending on the size of the model weights. The model files are cached locally, so you won't need to redownload them for subsequent runs.\n",
        "\n",
        "**Note**: If your runtime crashes or runs out of memory (OOM), consider switching to a different runtime type (such as **L4** or **A100**), or choose a better balance between **quantization** and **model size**."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pqpbFtEV7qR4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "UID                  Type    Name        Format      Size (in billions)  Quantization\n",
            "-------------------  ------  ----------  --------  --------------------  --------------\n",
            "gemma-2-27b-it-Q2_K  LLM     gemma-2-it  ggufv2                      27  Q2_K\n",
            "\n"
          ]
        }
      ],
      "source": [
        "!xinference list"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fT7KTcDTh-gc"
      },
      "source": [
        "After running `!xinference list`, you can see that the model with the correct UID is now available for use."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cfF-cCFlMCvE"
      },
      "source": [
        "## Interact with the model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MYKNW0c-MONc"
      },
      "source": [
        "Congrats! You now have the model running by Xinference. Once the model is running, you can try it out either command line, via cURL, or via Xinference’s Python client:\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rftddk_iJtUC"
      },
      "source": [
        "### Use a cURL request\n",
        "\n",
        "Let's quickly test the model using a sample prompt."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4eQsq-0nJvzb"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{\n",
            "  \"id\": \"chatcmpl-42dbf0ef-2973-4a92-80db-87989c6b7603\",\n",
            "  \"object\": \"chat.completion\",\n",
            "  \"created\": 1732796971,\n",
            "  \"model\": \"/root/.cache/huggingface/hub/models--bartowski--gemma-2-27b-it-GGUF/blobs/a361b524be3e172f3535b010c440d352f7f3103eee903d1eb939eea21d40e359\",\n",
            "  \"choices\": [\n",
            "    {\n",
            "      \"index\": 0,\n",
            "      \"message\": {\n",
            "        \"role\": \"assistant\",\n",
            "        \"content\": \"The **blue whale** is the largest animal on Earth. \\n\\nIt can grow up to 100 feet long and weigh over 200 tons. That's about the size of a Boeing 737 airplane! 🐳 \\n\"\n",
            "      },\n",
            "      \"finish_reason\": \"stop\"\n",
            "    }\n",
            "  ],\n",
            "  \"usage\": {\n",
            "    \"prompt_tokens\": 16,\n",
            "    \"completion_tokens\": 53,\n",
            "    \"total_tokens\": 69\n",
            "  }\n",
            "}\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
            "                                 Dload  Upload   Total   Spent    Left  Speed\n",
            "\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r100   138    0     0  100   138      0    114  0:00:01  0:00:01 --:--:--   114\r100   138    0     0  100   138      0     62  0:00:02  0:00:02 --:--:--    62\r100   138    0     0  100   138      0     43  0:00:03  0:00:03 --:--:--    43\r100   138    0     0  100   138      0     32  0:00:04  0:00:04 --:--:--    32\r100   138    0     0  100   138      0     26  0:00:05  0:00:05 --:--:--    26\r100   138    0     0  100   138      0     22  0:00:06  0:00:06 --:--:--     0\r100   717  100   579  100   138     84     20  0:00:06  0:00:06 --:--:--   123\r100   717  100   579  100   138     84     20  0:00:06  0:00:06 --:--:--   157\n"
          ]
        }
      ],
      "source": [
        "%%bash -s \"$model_uid\"\n",
        "model_uid=$1\n",
        "\n",
        "# Construct the JSON data with variable expansion\n",
        "request_payload=$(cat <<EOF\n",
        "{\n",
        "  \"model\": \"$model_uid\",\n",
        "  \"messages\": [\n",
        "    {\n",
        "      \"role\": \"user\",\n",
        "      \"content\": \"What is the largest animal?\"\n",
        "    }\n",
        "  ]\n",
        "}\n",
        "EOF\n",
        ")\n",
        "\n",
        "# Make the POST request using the constructed JSON data\n",
        "curl -X POST \\\n",
        "  'http://127.0.0.1:9997/v1/chat/completions' \\\n",
        "  -H 'Accept: application/json' \\\n",
        "  -H 'Content-Type: application/json' \\\n",
        "  -d \"$request_payload\" | jq ."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RJ_72F51XFZY"
      },
      "source": [
        "### Use Xinference's Python client"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ohZPPubkXKLl"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'id': 'chatcmpl-46ea8b7e-66fe-4aea-aa06-b05a54022221',\n",
              " 'object': 'chat.completion',\n",
              " 'created': 1732796978,\n",
              " 'model': '/root/.cache/huggingface/hub/models--bartowski--gemma-2-27b-it-GGUF/blobs/a361b524be3e172f3535b010c440d352f7f3103eee903d1eb939eea21d40e359',\n",
              " 'choices': [{'index': 0,\n",
              "   'message': {'role': 'assistant',\n",
              "    'content': 'I am Gemma, a large language model created by the Gemma team at Google DeepMind. I am trained on a massive dataset of text and code, which allows me to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. I am still under development, but I have learned to perform many kinds of tasks.'},\n",
              "   'finish_reason': 'stop'}],\n",
              " 'usage': {'prompt_tokens': 14, 'completion_tokens': 73, 'total_tokens': 87}}"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "#@title **Chat with the Model**\n",
        "#@markdown Enter your message below to chat with the model.\n",
        "\n",
        "query = \"Who are you?\" #@param {type:\"string\"}\n",
        "\n",
        "# Send a chat message\n",
        "model = client.get_model(model_uid)\n",
        "model.chat(messages=[\n",
        "    {\n",
        "        \"role\": \"user\",\n",
        "        \"content\": query\n",
        "    }\n",
        "])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cBFVSQIVVJaS"
      },
      "source": [
        "Congratulations on completing this tutorial! You've learned how to set up **Xinference** and **llama.cpp** to run different Gemma 2 models in the GGUF format and also interact with these models using Xinference's built-in Python client.\n",
        "\n",
        "## Next Steps\n",
        "\n",
        "For an even better user experience, consider exploring the following:\n",
        "\n",
        "- **Xinference Documentation**:\n",
        "  * [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)\n",
        "  * [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)\n",
        "  * [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)\n",
        "- **Integrate Retrieval Augmented Generation (RAG)**: Improve the model's responses by incorporating a retrieval mechanism that fetches relevant information from external knowledge bases or documents, enhancing accuracy and context.\n",
        "- **Utilize External APIs**: Expand the model's capabilities by connecting to external APIs for real-time data and services, enabling dynamic and up-to-date responses.\n",
        "- **Enhance Output Formatting**: Modify the output display to mimic a chat interface for a more user-friendly interaction.\n",
        "\n",
        "Enjoy experimenting with Gemma models!"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "[Gemma_2]Using_with_Xinference.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
